By James Brunner and Vyoma Jani
As computer science majors, we hear lots of stories about people working their dream jobs in Silicon Valley. Top companies and high pay definitely seems appealing! But we've also heard that the Bay Area is a very expensive place to live, so is it really worth moving there for the high pay?
This tutorial is going to explore the salaries, total years of experience, and other factors for a number of computer scientists across the country. Then, it will look into the cost of living in each of these locations, before determining what cities are best to live in and which ones may be better to avoid. We'll be going through the entire data science pipeline, from gathering the data and organizing it, to performing exploratory data analysis and machine learning on the information collected!
The first step in our process is to scour the web in search of relevant data. First, we will be looking at salary data from Levels.fyi. This dataset contains information about employees from various companies, levels, roles, locations, and paygrades, all collected from users who voluntarily provided their information.
To read in the dataset, we will first generate an HTTP get request using the requests library, and then we will use pandas, a Python library popular for data manipulation and analysis, to convert the json file from the get request into a dataframe.
import pandas as pd
import requests
salary_url = "https://www.levels.fyi/js/salaryData.json"
salary_data = requests.get(salary_url).json()
salaries = pd.DataFrame(salary_data)
salaries.head()
| timestamp | company | level | title | totalyearlycompensation | location | yearsofexperience | yearsatcompany | tag | basesalary | stockgrantvalue | bonus | gender | otherdetails | cityid | dmaid | rowNumber | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 6/7/2017 11:33:27 | Oracle | L3 | Product Manager | 127 | Redwood City, CA | 1.5 | 1.5 | 107 | 20 | 10 | 7392 | 807 | 1 | |||
| 1 | 6/10/2017 17:11:29 | eBay | SE 2 | Software Engineer | 100 | San Francisco, CA | 5 | 3 | 7419 | 807 | 2 | ||||||
| 2 | 6/11/2017 14:53:57 | Amazon | L7 | Product Manager | 310 | Seattle, WA | 8 | 0 | 155 | 11527 | 819 | 3 | |||||
| 3 | 6/14/2017 21:22:25 | Microsoft | 64 | Software Engineering Manager | 200 | Redmond, WA | 9 | 9 | 169000 | 100000 | 30000 | 11521 | 819 | 5 | |||
| 4 | 6/16/2017 10:44:01 | Amazon | L5 | Software Engineer | 173 | Vancouver, BC, Canada | 11 | 1 | 120000 | 0 | 53000 | 1320 | 0 | 6 |
We will also be looking at the following dataset from Numbeo.com, which provides information regarding the cost of living for various US cities, breaking down the total index into subdivisions like rent index or groceries index. This information is similarly collected voluntarily from users.
Like before, we will be using pandas and the requests library, but we will also be needing the BeautifulSoup library to parse the html data from the request into the dataframe.
from bs4 import BeautifulSoup
import requests
import pandas as pd
numbeo_url = 'https://www.numbeo.com/cost-of-living/region_rankings.jsp?title=2020-mid®ion=021'
numbeo_request = requests.get(numbeo_url)
root = BeautifulSoup(numbeo_request.content, 'html.parser')
table = root.find("table", id = "t2")
numbeo = pd.read_html(table.prettify())[0]
numbeo.drop("Rank", axis = 1, inplace = True)
# match format of salaries -- should this been in Data Processing or no? Because it's literally the only change
numbeo["City"] = numbeo["City"].map(lambda x: x.replace(", United States", ""))
numbeo.head()
| City | Cost of Living Index | Rent Index | Cost of Living Plus Rent Index | Groceries Index | Restaurant Price Index | Local Purchasing Power Index | |
|---|---|---|---|---|---|---|---|
| 0 | New York, NY | 100.00 | 100.00 | 100.00 | 100.00 | 100.00 | 100.00 |
| 1 | San Francisco, CA | 92.13 | 109.76 | 100.72 | 89.79 | 88.26 | 139.00 |
| 2 | Anchorage, AK | 90.83 | 36.51 | 64.38 | 89.86 | 75.91 | 119.91 |
| 3 | Oakland, CA | 88.68 | 79.83 | 84.37 | 94.92 | 72.03 | 97.98 |
| 4 | Boston, MA | 88.61 | 75.13 | 82.05 | 89.28 | 84.34 | 107.59 |
All of the Numbeo indices are relative to New York City, meaning New York City will have 100% for each index. If a city has, for example, index 120 for a given category, the city is on average 20% more expensive for that category than New York City.
The indices that the Numbeo data is organized into signify the following (taken from the Numbeo website):
Note that the information from Numbeo is from 2020, whereas the information from Levels.fyi is from 2017 to 2020. While we will not be analyzing the data over time, the year may have an effect on the other variables in the data.
Now that we have gathered our two datasets, we will tidy up the data to make it easier to analyze. For the Levels.fyi data, we will again be needing pandas, and we're also importing the datetime module to help us convert the timestamp information into datetime objects, which will standardize the time.
For the Levels.fyi dataset, we observe that many of the numeric columns, like totalyearlycompensation, are being stored as strings in the datasets. Because we will be interpreting the values as numbers instead of strings, we will convert the values in those numeric columns to numbers.
Also, because we are computer science majors, we're only focusing on computer science jobs!
# TODO:
# Fix stock values
import pandas as pd
from datetime import datetime
# Remove unwanted columns
salaries.drop(["tag", "gender", "otherdetails", "cityid", "dmaid", "rowNumber"], axis = 1, inplace = True)
# Filter out non-tech positions
to_remove = ["Marketing", "Sales", "Recruiter", "Hardware Engineer", "Management Consultant", "Business Analyst"] # can add the others
salaries = salaries[~salaries["title"].isin(to_remove)]
# parsing datetimes
for index, row in salaries.iterrows():
salaries.at[index, "timestamp"] = datetime.strptime(salaries.at[index, "timestamp"], "%m/%d/%Y %H:%M:%S")
# parsing salary and compensation as numbers
salaries["totalyearlycompensation"] = pd.to_numeric(salaries["totalyearlycompensation"])
salaries["yearsofexperience"] = pd.to_numeric(salaries["yearsofexperience"])
salaries["yearsatcompany"] = pd.to_numeric(salaries["yearsatcompany"])
salaries["basesalary"] = pd.to_numeric(salaries["basesalary"])
salaries["stockgrantvalue"] = pd.to_numeric(salaries["stockgrantvalue"])
salaries["bonus"] = pd.to_numeric(salaries["bonus"])
salaries.head()
| timestamp | company | level | title | totalyearlycompensation | location | yearsofexperience | yearsatcompany | basesalary | stockgrantvalue | bonus | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2017-06-07 11:33:27 | Oracle | L3 | Product Manager | 127.0 | Redwood City, CA | 1.5 | 1.5 | 107.0 | 20.0 | 10.0 |
| 1 | 2017-06-10 17:11:29 | eBay | SE 2 | Software Engineer | 100.0 | San Francisco, CA | 5.0 | 3.0 | NaN | NaN | NaN |
| 2 | 2017-06-11 14:53:57 | Amazon | L7 | Product Manager | 310.0 | Seattle, WA | 8.0 | 0.0 | 155.0 | NaN | NaN |
| 3 | 2017-06-14 21:22:25 | Microsoft | 64 | Software Engineering Manager | 200.0 | Redmond, WA | 9.0 | 9.0 | 169000.0 | 100000.0 | 30000.0 |
| 4 | 2017-06-16 10:44:01 | Amazon | L5 | Software Engineer | 173.0 | Vancouver, BC, Canada | 11.0 | 1.0 | 120000.0 | 0.0 | 53000.0 |
Before moving into exploratory data analysis, it is important to check our data for outliers and for any values that may have errors or may have been computed incorrectly.
salaries[["totalyearlycompensation", "basesalary", "stockgrantvalue", "bonus"]].describe()
| totalyearlycompensation | basesalary | stockgrantvalue | bonus | |||||
|---|---|---|---|---|---|---|---|---|
| count | <<<<<<< HEAD34388.000000 | 3.217700e+04 | 3.181500e+04 | 30522.000000 | ||||
| mean | 230.091762 | 3.482667e+03 | 2.742548e+03 | 492.125667 | ||||
| std | 143.162435 | 2.837062e+04 | 7.593620e+04 | 5099.747516 | =======34394.000000 | 3.218000e+04 | 3.181700e+04 | 30524.000000 |
| mean | 230.120414 | 3.487809e+03 | 2.742372e+03 | 492.126773 | ||||
| std | 143.177444 | 2.838541e+04 | 7.593382e+04 | 5099.579377 | >>>>>>> 4352d59a083c63a2aa489b5210fdfdf0f174fe88||||
| min | 10.000000 | 1.000000e+00 | 0.000000e+00 | 0.000000 | ||||
| 25% | 150.000000 | 1.200000e+02 | 9.000000e+00 | 7.000000 | ||||
| 50% | 200.000000 | 1.460000e+02 | 3.400000e+01 | 17.000000 | ||||
| 75% | 277.000000 | 1.750000e+02 | <<<<<<< HEAD8.050000e+01 | =======8.000000e+01 | >>>>>>> 4352d59a083c63a2aa489b5210fdfdf0f174fe8830.000000 | |||
| max | 5000.000000 | 2.000000e+06 | 1.200000e+07 | 280000.000000 |
From these summary statistics, we can observe there are some serious outliers. The mean base salary is $3,287,000, which indicates that we may have some faulty values causing the data to be left-skewed. The same appears to be true for stock grant value and bonus. We can also see how significant the outliers get by observing how large the max for each column is in comparison to the 75th percentile.
# further investigate the counts/number of outliers causing this
# is there a way we can tie MNAR to this? It seems they input their data wrong not knowing it was already counting in thousands
We want to avoid these outliers to prevent our analysis from being skewed, so we will trim the __ highest and lowest salaries from our dataset. Note that there is a risk of introducing bias here, since it is possible that those trimmed salaries are far from data for a reason relevant to our analysis.
# ends of the data we will trim. 2.5% off of each end
lower = salaries["basesalary"].quantile(0.025)
upper = salaries["basesalary"].quantile(0.975)
salaries = salaries[(salaries["basesalary"] >= lower) & (salaries["basesalary"] <= upper)]
salaries.head()
| timestamp | company | level | title | totalyearlycompensation | location | yearsofexperience | yearsatcompany | basesalary | stockgrantvalue | bonus | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2017-06-07 11:33:27 | Oracle | L3 | Product Manager | 127.0 | Redwood City, CA | 1.5 | 1.5 | 107.0 | 20.0 | 10.0 |
| 2 | 2017-06-11 14:53:57 | Amazon | L7 | Product Manager | 310.0 | Seattle, WA | 8.0 | 0.0 | 155.0 | NaN | NaN |
| 5 | 2017-06-17 00:23:14 | Apple | M1 | Software Engineering Manager | 372.0 | Sunnyvale, CA | 7.0 | 5.0 | 157.0 | 180.0 | 35.0 |
| 9 | 2017-06-22 12:37:51 | Microsoft | 65 | Software Engineering Manager | 300.0 | Redmond, WA | 15.0 | 11.0 | 180.0 | 65.0 | 55.0 |
| 10 | 2017-06-22 13:55:26 | Microsoft | 62 | Software Engineer | 156.0 | Seattle, WA | 4.0 | 4.0 | 135.0 | 8.0 | 13.0 |
Our next step in data preparation involves filtering out all jobs that are not located in the United States. We can observe that in the location column of the salaries dataset, locations in the United States are formatted as "City, State", while other countries (such as Canada) are formatted as "City, State/Province, Country". Using Regular Expressions, we can easily filter out these rows.
import re
# to drop non-US:
# regex -> ^[a-z A-Z\.]+, [A-Z]{2}$
# drop if they dont match this... Because the country will follow it, unless in U.S.
# Will this keep Washington D.C. ? ---- Yes!! its formatted Washington, DC
regex = r'^[a-z A-Z\.]+, [A-Z]{2}$'
# check our regex is good -- we lose slightly over 3,000 rows which is fine
print("Deleting Cities: ")
print(salaries[~salaries.location.str.contains(regex, regex = True, na = False)]["location"])
# filter
salaries = salaries[salaries.location.str.contains(regex, regex = True, na = False)]
salaries.head()
Deleting Cities:
48 Dublin, DN, Ireland
135 Cambridge, EN, United Kingdom
179 Bangalore, KA, India
225 Remote
355 Dublin, DN, Ireland
...
37351 London, EN, United Kingdom
37352 Berlin, BE, Germany
37374 Sydney, NS, Australia
37386 Amsterdam, NH, Netherlands
37389 Winchester, EN, United Kingdom
Name: location, Length: 3127, dtype: object
=======
48 Dublin, DN, Ireland
135 Cambridge, EN, United Kingdom
179 Bangalore, KA, India
226 Remote
356 Dublin, DN, Ireland
...
37353 Toronto, ON, Canada
37358 Amsterdam, NH, Netherlands
37373 London, EN, United Kingdom
37374 Berlin, BE, Germany
37396 Sydney, NS, Australia
Name: location, Length: 3156, dtype: object
>>>>>>> 4352d59a083c63a2aa489b5210fdfdf0f174fe88
| timestamp | company | level | title | totalyearlycompensation | location | yearsofexperience | yearsatcompany | basesalary | stockgrantvalue | bonus | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2017-06-07 11:33:27 | Oracle | L3 | Product Manager | 127.0 | Redwood City, CA | 1.5 | 1.5 | 107.0 | 20.0 | 10.0 |
| 2 | 2017-06-11 14:53:57 | Amazon | L7 | Product Manager | 310.0 | Seattle, WA | 8.0 | 0.0 | 155.0 | NaN | NaN |
| 5 | 2017-06-17 00:23:14 | Apple | M1 | Software Engineering Manager | 372.0 | Sunnyvale, CA | 7.0 | 5.0 | 157.0 | 180.0 | 35.0 |
| 9 | 2017-06-22 12:37:51 | Microsoft | 65 | Software Engineering Manager | 300.0 | Redmond, WA | 15.0 | 11.0 | 180.0 | 65.0 | 55.0 |
| 10 | 2017-06-22 13:55:26 | Microsoft | 62 | Software Engineer | 156.0 | Seattle, WA | 4.0 | 4.0 | 135.0 | 8.0 | 13.0 |
Now that both datasets have been properly cleaned, the last step is to merge them together. One noticeable problem is that the Numbeo dataset does not contain all the same cities as the Levels.fyi dataset, and therefore we cannot make a simple join between the two.
Instead, since major cities have an influence on the economics of nearby towns, we will use an algorithm to join each Levels.fyi entry to a closest neighboring city that is in the Numbeo dataset, as long as they are within a reasonable distance from each other.
Step #1: Get the longitude and latitude coordinates in each dataset
Step #2: Match locations
We will also discard any rows that cannot match a city in our algorithm, as we consider those jobs too far from major U.S. cities and we are not interested in those for the purpose of this analysis.
For the algorithm, we will make use of the Geopy library, which offers tools to locate coordinates across the globe.
from geopy.geocoders import Nominatim
geolocator = Nominatim(user_agent = "Final Project")
# step one: get longitude and latitude coordinates in each dataset
def get_coordinates(df, col_name):
# faster look up instead of querying coordinates for cities we have seen before
found_cities = {}
unfound_cities = set()
longitude_list = []
latitude_list = []
for index, row in df.iterrows():
# we have found this city before and have the coordinates
if row[col_name] in found_cities:
longitude_list.append(found_cities[row[col_name]][0])
latitude_list.append(found_cities[row[col_name]][1])
else:
# we already have recorded we can't find this city
if row[col_name] in unfound_cities:
longitude_list.append(0)
latitude_list.append(0)
else:
# It may time out if it can't find the city, so we add a try/except
try:
result = geolocator.geocode(row[col_name])
except:
result = None
# new city we couldn't find
if result is None:
print("Unfound City:", row[col_name])
# Do we want to keep printing
unfound_cities.add(row[col_name])
longitude_list.append(0)
latitude_list.append(0)
# new city we found
else:
longitude_list.append(result.longitude)
latitude_list.append(result.latitude)
found_cities[row[col_name]] = (result.longitude, result.latitude)
# return longitude list, latitude list, and those we couldn't find so we can remove those rows
return longitude_list, latitude_list, unfound_cities
# before we do this, we need to remove NA's from location in the salaries dataset
salaries.dropna(subset = ['location'], inplace = True)
salaries["longitude"], salaries["latitude"], cities_to_drop_salaries = get_coordinates(salaries, "location")
numbeo["longitude"], numbeo["latitude"], cities_to_drop_numbeo = get_coordinates(numbeo, "City")
salaries = salaries[~salaries["location"].isin(cities_to_drop_salaries)]
numbeo = numbeo[~numbeo["City"].isin(cities_to_drop_numbeo)]
salaries = salaries[salaries["longitude"] != 0]
numbeo = numbeo[numbeo["longitude"] != 0]
Unfound City: New York Mills, NY Unfound City: O Fallon, MO Unfound City: Ireland, IN Unfound City: Brazil, IN
from geopy.distance import geodesic
# step 2: location matching algorithm
# dictionary: salary city name to closest match
# if we dont have it already -- compute it
# must iterate each row in numbeo and keep a min, with the city name
# remember to check distance too -- I will make the max distance 50 miles for now
# this doesn't really need to be a method but whatever
# if we want we could also add a distance column very easily
def location_match(df):
seen_cities = {} # values are tuple: (city name, distance)
distance_list = []
joining_city_list = []
for index, row in df.iterrows():
if row["location"] in seen_cities:
joining_city_list.append(seen_cities[row["location"]][0])
distance_list.append(seen_cities[row["location"]][1])
else:
# start with the first one -- New York, NY
curr_location = (row["latitude"], row["longitude"])
comparing_location = (numbeo.at[0, "latitude"], numbeo.at[0, "longitude"])
min_distance = geodesic(curr_location, comparing_location).miles
min_city = "New York, NY"
for index2, row2 in numbeo.iterrows():
comparing_location = (row2["latitude"], row2["longitude"])
distance = geodesic(curr_location, comparing_location).miles
if distance < min_distance:
min_distance = distance
min_city = row2["City"]
joining_city_list.append(min_city)
distance_list.append(min_distance)
seen_cities[row["location"]] = (min_city, min_distance)
return joining_city_list, distance_list
salaries["joining_city"], salaries["distance"] = location_match(salaries)
salaries = salaries[salaries["distance"] <= 50]
# after merging, maybe write to a CSV to make it easy to load and do EDA without having to re-run all the steps above
df = salaries.merge(numbeo, how = 'left', left_on = "joining_city", right_on = "City")
df.head()
| timestamp | company | level | title | totalyearlycompensation | location | yearsofexperience | yearsatcompany | basesalary | stockgrantvalue | bonus | longitude_x | latitude_x | joining_city | distance | City | Cost of Living Index | Rent Index | Cost of Living Plus Rent Index | Groceries Index | Restaurant Price Index | Local Purchasing Power Index | longitude_y | latitude_y | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2017-06-07 11:33:27 | Oracle | L3 | Product Manager | 127.0 | Redwood City, CA | 1.5 | 1.5 | 107.0 | 20.0 | 10.0 | -122.232523 | 37.486324 | San Jose, CA | 21.471326 | San Jose, CA | 77.55 | 81.32 | 79.39 | 69.57 | 82.46 | 143.11 | -121.890583 | 37.336191 |
| 1 | 2017-06-11 14:53:57 | Amazon | L7 | Product Manager | 310.0 | Seattle, WA | 8.0 | 0.0 | 155.0 | NaN | NaN | -122.330062 | 47.603832 | Seattle, WA | 0.000000 | Seattle, WA | 87.99 | 64.00 | 76.31 | 80.83 | 86.82 | 133.18 | -122.330062 | 47.603832 |
| 2 | 2017-06-17 00:23:14 | Apple | M1 | Software Engineering Manager | 372.0 | Sunnyvale, CA | 7.0 | 5.0 | 157.0 | 180.0 | 35.0 | -122.036350 | 37.368830 | San Jose, CA | 8.334578 | San Jose, CA | 77.55 | 81.32 | 79.39 | 69.57 | 82.46 | 143.11 | -121.890583 | 37.336191 |
| 3 | 2017-06-22 12:37:51 | Microsoft | 65 | Software Engineering Manager | 300.0 | Redmond, WA | 15.0 | 11.0 | 180.0 | 65.0 | 55.0 | -122.123877 | 47.669414 | Seattle, WA | 10.640599 | Seattle, WA | 87.99 | 64.00 | 76.31 | 80.83 | 86.82 | 133.18 | -122.330062 | 47.603832 |
| 4 | 2017-06-22 13:55:26 | Microsoft | 62 | Software Engineer | 156.0 | Seattle, WA | 4.0 | 4.0 | 135.0 | 8.0 | 13.0 | -122.330062 | 47.603832 | Seattle, WA | 0.000000 | Seattle, WA | 87.99 | 64.00 | 76.31 | 80.83 | 86.82 | 133.18 | -122.330062 | 47.603832 |
# so we won't have to re-run all the long algorithms
df.to_csv("saved.csv")
# just run this if there were any unwanted changes to the merged df
import pandas as pd
df = pd.read_csv("saved.csv")
try:
import plotly.express as px
except:
!{sys.executable} -m pip install plotly_express
import plotly.express as px
import seaborn as sns
import folium
# Remove warning messages
import warnings
warnings.filterwarnings('ignore')
First, we will examine the relationship between years of experience and base salary using a scatterplot. We want to determine whetehr the relationship is linear, so we will plot a line of best fit on top.
fig = px.scatter(df, x = "yearsofexperience", y = "basesalary", trendline = "ols",
labels = {
"basesalary": "Base Salary",
"yearsofexperience": "Years of Experience"
}, title = "Predicting Base Salary based on Years of Experience")
# fig.show() can be used for interactive charts, however these won't work on GitHub
# fig.show()
/usr/local/lib/python3.6/dist-packages/statsmodels/tools/_testing.py:19: FutureWarning: pandas.util.testing is deprecated. Use the functions in the public API at pandas.testing instead.
It looks like there's a positive linear relationship between years of experience and base salary!
However, does this relationship extend for all job positions? Or are there specific Computer Science roles that have a different relationship between base salary and years of experience?
Below, we plot the years of experience against the base salary of the 7 roles we are looking to study. We will create separate scatterplots for the roles, and will again plot a line of best fit to see if there is some type of linear relationship between the two for each role.
# does adding the years at company dimension help us notice anything?
=======
try:
import plotly.express as px
except:
!{sys.executable} -m pip install plotly_express
import plotly.express as px
from IPython.display import HTML
# does adding the years at company dimension help us notice anything?
>>>>>>> 4352d59a083c63a2aa489b5210fdfdf0f174fe88
# instead of trimming the outliers, we could rescale the y axis, but this is often considered bad practice
# currently using basesalary instead of totalyearlycompensation
# swap these 2 first lines
# fig = px.scatter(df[df["totalyearlycompensation"] <= 2000], x = "yearsofexperience", y = "totalyearlycompensation",
fig = px.scatter(df, x = "yearsofexperience", y = "basesalary",
color = "yearsatcompany", color_continuous_scale = "rainbow", facet_col = "title", facet_col_wrap = 4, trendline = "ols",
labels = {
"basesalary": "Base Salary",
"yearsofexperience": "Years of Experience",
"facet_col": "title"
}, title = "Predicting Base Salary based on Years of Experience")
<<<<<<< HEAD
# fig.show()
=======
display(HTML(fig.to_html()))
>>>>>>> 4352d59a083c63a2aa489b5210fdfdf0f174fe88
Based on the above plots, we can see a positive correlation between years of experience and base salary for each role. It looks like the base salary increases relatively slowly over years of experience for roles like Software Engineering Manager, compared to roles like Software Engineer.
Now we want to narrow our focus to specific cities and companies. We'll be looking at the average base salary per US city and per company in our dataset. Below, we calculate the average base salaries.
First, we calculate the average base salary per city and company in the dataframe.
# plot average salary in each city
mean_df = df[["basesalary", "City"]].groupby("City").mean().reset_index()
count_df = df[["City"]].value_counts().reset_index()
mean_df.rename(columns = {"basesalary": "basesalary_mean"}, inplace = True)
count_df.rename(columns = {0: "count"}, inplace = True)
summary_df = pd.merge(mean_df, count_df, how = "inner", left_on = "City", right_on = "City")
summary_df = pd.merge(summary_df, df, how = "left", left_on = "City", right_on = "City")
wanted_columns = ["City", "longitude_y", "latitude_y", "basesalary_mean", "count", "Cost of Living Plus Rent Index"]
summary_df = summary_df[wanted_columns]
summary_df = summary_df.drop_duplicates()
# we will also remove cities with less than 30 observations, as we want to have a decent sized sample from each city
summary_df = summary_df[summary_df["count"] >= 30]
summary_df.head()
| City | longitude_y | latitude_y | basesalary_mean | count | Cost of Living Plus Rent Index | |
|---|---|---|---|---|---|---|
| 18 | Atlanta, GA | -84.390264 | 33.748992 | 122.925000 | 200 | 61.88 |
| 218 | Austin, TX | -97.743700 | 30.271129 | 134.277603 | 634 | 56.81 |
| 891 | Boston, MA | -71.058291 | 42.360253 | 137.592269 | 802 | 82.05 |
| 1698 | Charlotte, NC | -80.843083 | 35.227209 | 128.354167 | 48 | 59.39 |
| 1746 | Chicago, IL | -87.624421 | 41.875562 | 129.244966 | 298 | 69.30 |
company_df = df[["basesalary", "company"]].groupby("company").mean().reset_index()
company_count_df = df[["company"]].value_counts().reset_index()
company_df = pd.merge(company_df, company_count_df, how = "inner", left_on = "company", right_on = "company")
company_df.rename(columns = {"basesalary": "basesalary_mean"}, inplace = True)
company_df.rename(columns = {0: "count"}, inplace = True)
# we will also remove companies with less than 30 observations, as we want to have a decent sized sample for companies we analyze
company_df = company_df[company_df["count"] >= 30]
company_df.head()
| company | basesalary_mean | count | |
|---|---|---|---|
| 14 | AT&T | 116.470588 | 34 |
| 16 | Accenture | 121.645570 | 79 |
| 20 | Adobe | 155.603550 | 169 |
| 24 | Airbnb | 191.426752 | 157 |
| 34 | Amazon | 144.978677 | 3658 |
Now that we have calculated average base salaries, lets plot them!
First, we'll plot a histogram of the top 20 highest-paying companies, based on average base salary.
# Top 20 companies
fig = px.bar(company_df.nlargest(20, "basesalary_mean"), x = 'company', y = 'basesalary_mean',
labels = {
'basesalary_mean': 'Average Base Salary',
'company': 'Company'
}, height = 400, title = "Average Base Salary of Top 20 Highest Paying Companies")
# fig.show()
Based on this figure, it seems on average, that Netflix pays much higher than every other company in the dataset, offering nearly double the next highest paying company offers. The rest of the top 20 companies tend to pay, on average, around $200,000 for their base salaries.
Now let's take a look at what locations have the highest paying jobs. We will create a map that displays different colored circles on cities depending on their average salary to see where the higher paying jobs are located.
=======Now let's take a look at what locations have the highest paying jobs.
>>>>>>> 4352d59a083c63a2aa489b5210fdfdf0f174fe88def color(salary):
if salary < 100:
=======
import folium
def color(salary):
if salary < 75:
>>>>>>> 4352d59a083c63a2aa489b5210fdfdf0f174fe88
return "blue"
elif salary >= 75 and salary < 95:
return "green"
elif salary >= 95 and salary < 115:
return "red"
elif salary >= 115 and salary < 135:
return "orange"
else:
return "black"
map = folium.Map(width = 800, height = 500, location=[40, -98], zoom_start = 4)
for i in range(0, len(summary_df)):
folium.Circle(
location = [summary_df.iloc[i]["latitude_y"], summary_df.iloc[i]["longitude_y"]],
radius = 100000,
color = color(summary_df.iloc[i]["basesalary_mean"]),
tooltip = "City: " + str(summary_df.iloc[i]["City"]) + ", " + "Average Base Salary: " + str(summary_df.iloc[i]["basesalary_mean"])
).add_to(map)
map
As we can see from the black circles, the highest paying computer science jobs are on the coasts. However, there are still relatively high paying jobs in non-coastal areas like Pittsburgh or Austin, based on the yellow circles.
We want to take a closer look at the 10 highest paying locations to get a better sense of the relationship between years of experience and base salary there. Once again, we plot a matrix of scatterplots below.
As we can see, most of the highest paying computer science jobs are on the coasts, with the exception of a few cities like Colorado Springs. It will be useful to see what the cost of living is like in these highest paying regions.
desired_cities = summary_df.nlargest(10, "basesalary_mean").City.tolist()
fig = px.scatter(df[df["City"].isin(desired_cities)], x = "yearsofexperience", y = "basesalary", trendline = "ols",
facet_col = "City", facet_col_wrap = 5,
labels = {
"yearsofexperience": "Years of Experience",
"basesalary": "Base Salary"
}, title = "Predicting Base Salary over Years of Experience by Location")
# fig.show()
It seems that all the cities have positive linear relationships between years of experience and base salary. However, some cities, like San Francisco, have a larger increase in base salary as experience increases, as can be seen by the steeper slope in those graphs. Meanwhile, Pittsburgh and Irvine increase at slower rates. So far, based on salary alone, it seems like the Bay Area is a great place to live, and it offers high growth opportunities for employees. But is the salary worth the cost?
Now, we turn our attention to the cost of living as it relates to salaries, locations, and other factors. First, we will compare base salary to the cost of living index of each location using another scatterplot.
Now, we will compare base salary to the cost of living index of each location using another scatterplot.
fig = px.scatter(df, x = "basesalary", y = "Cost of Living Index", trendline = "ols",
labels = {
"basesalary": "Base Salary",
}, title = "Predicting Cost of Living Index based on Base Salary")
# fig.show()
It's a little hard to tell if the above line of best fit actually represents the data well, since the points are all over the place. Thus, we'll create a matrix of correlations between base salary and the different types of cost of living indices to see if there's any more specific linear relationships.
cols = ["basesalary", "Cost of Living Index", "Rent Index", "Cost of Living Plus Rent Index", "Groceries Index", "Restaurant Price Index", "Local Purchasing Power Index"]
sns.heatmap(df[cols].corr(), annot = True, fmt = '.1g', vmin=-1, vmax=1, center= 0, cmap= 'RdYlGn').set_title("Correlation Matrix of Base Salary and Numbeo Cost of Living Indices")
Text(0.5, 1.0, 'Correlation Matrix of Base Salary and Numbeo Cost of Living Indices')
The above correlation matrix indicates that we have a slight positive relationship between base salary and every Numbeo index. The strongest relationships base salary has appear to be with Rent Index and Cost of Living Plus Rent Index, with correlations of 0.3. We'll focus on just the cost of living plus rent index.
We've seen what the highest paying cities are, and we've seen the positive correlation between cost of living plus rent index and base salaries. Now, we're going to combine these two pieces of information together in a histogram below, which graphs the rent index of the 20 highest paying cities.
=======The above correlation matrix indicates that we have a slight positive relationship between base salary and every Numbeo index. The strongest relationships base salary has appear to be with Rent Index and Cost of Living Plus Rent Index, with correlations of 0.3.
>>>>>>> 4352d59a083c63a2aa489b5210fdfdf0f174fe88# Top 20 cities
fig = px.bar(summary_df.nlargest(20, "basesalary_mean"), x = 'City', y = 'Cost of Living Plus Rent Index',
color = 'basesalary_mean', labels = {
'basesalary_mean':'Average Base Salary'
}, height = 400, title = "Cost of Living Plus Rent Index for Top 20 Highest Paying Cities")
# fig.show()
# Why is San Francisco so much more expensive compared to Oakland and San Jose ?
# Maybe it could be the number of companies there and population?
# We could easily run some aggregate operations to get a count of companies, but population may be annoying
This plot highlights certain cities as being relatively expensive to live in compared to cities that pay similarly. San Francisco pays the most, but it is also extremely expensive to live in. Meanwhile, cities like New York, Boston, and Washington DC show peaks in the graph, indicating that they are relatively expensive to live in given their average salaries compared to similar cities. Meanwhile, Pittsburgh and Little Rock, for example, seem much cheaper to live in relative to otehr cities with similar average salaries, as we can see dips in the histogram at those cities.
Now that we have analyzed the data, we can begin using the analysis to make predictions about a person's salary based on their role, years of experience, and location. We will be performing a multiple linear regression analysis to predict salary.
Learn more about multiple linear regression here: https://www.jmp.com/en_us/statistics-knowledge-portal/what-is-multiple-regression/mlr-with-interactions.html
Below, we use the Statsmodel library to perform ordinary least squares regression with multiple interaction terms.
from statsmodels.formula.api import ols
# These are the cities with an adequate number of obersvations (>= 30)
cities = summary_df.City.tolist()
# should we add interaction term as years of experience and title may be related ?
model = ols(formula = "basesalary ~ title + yearsofexperience + Q('Cost of Living Plus Rent Index') + City", data = df[df["City"].isin(cities)])
reg = model.fit()
reg.summary()
| Dep. Variable: | basesalary | R-squared: | 0.472 |
|---|---|---|---|
| Model: | OLS | Adj. R-squared: | 0.471 |
| Method: | Least Squares | F-statistic: | 592.5 |
| Date: | Mon, 21 Dec 2020 | Prob (F-statistic): | 0.00 |
| Time: | 17:46:32 | Log-Likelihood: | -1.2833e+05 |
| No. Observations: | 26580 | AIC: | 2.567e+05 |
| Df Residuals: | 26539 | BIC: | 2.571e+05 |
| Df Model: | 40 | ||
| Covariance Type: | nonrobust |
| coef | std err | t | P>|t| | [0.025 | 0.975] | |
|---|---|---|---|---|---|---|
| Intercept | 19.1947 | 2.785 | 6.891 | 0.000 | 13.735 | 24.654 |
| title[T.Product Designer] | -13.2945 | 1.410 | -9.427 | 0.000 | -16.059 | -10.530 |
| title[T.Product Manager] | -4.9346 | 1.096 | -4.501 | 0.000 | -7.084 | -2.786 |
| title[T.Software Engineer] | -5.5953 | 0.909 | -6.158 | 0.000 | -7.376 | -3.814 |
| title[T.Software Engineering Manager] | 9.0033 | 1.190 | 7.564 | 0.000 | 6.670 | 11.336 |
| title[T.Solution Architect] | -11.5910 | 1.714 | -6.762 | 0.000 | -14.951 | -8.231 |
| title[T.Technical Program Manager] | -16.2326 | 1.559 | -10.412 | 0.000 | -19.288 | -13.177 |
| City[T.Austin, TX] | 15.4769 | 2.485 | 6.228 | 0.000 | 10.606 | 20.347 |
| City[T.Boston, MA] | -9.9083 | 2.286 | -4.335 | 0.000 | -14.388 | -5.428 |
| City[T.Charlotte, NC] | 7.3710 | 4.870 | 1.514 | 0.130 | -2.175 | 16.917 |
| City[T.Chicago, IL] | -0.4012 | 2.728 | -0.147 | 0.883 | -5.747 | 4.945 |
| City[T.Columbus, OH] | 3.9350 | 5.231 | 0.752 | 0.452 | -6.318 | 14.188 |
| City[T.Dallas, TX] | 6.2463 | 2.689 | 2.323 | 0.020 | 0.976 | 11.517 |
| City[T.Denver, CO] | 4.9774 | 3.438 | 1.448 | 0.148 | -1.762 | 11.716 |
| City[T.Detroit, MI] | 2.0776 | 5.139 | 0.404 | 0.686 | -7.996 | 12.151 |
| City[T.Fort Lauderdale, FL] | -8.7460 | 5.587 | -1.566 | 0.117 | -19.696 | 2.204 |
| City[T.Houston, TX] | 5.7878 | 3.723 | 1.555 | 0.120 | -1.509 | 13.084 |
| City[T.Irvine, CA] | -6.7259 | 3.252 | -2.068 | 0.039 | -13.101 | -0.351 |
| City[T.Kansas City, MO] | -7.8660 | 4.347 | -1.809 | 0.070 | -16.387 | 0.655 |
| City[T.Little Rock, AR] | 29.5336 | 5.352 | 5.519 | 0.000 | 19.044 | 40.023 |
| City[T.Los Angeles, CA] | 16.4891 | 2.512 | 6.565 | 0.000 | 11.566 | 21.412 |
| City[T.Madison, WI] | 20.7144 | 4.643 | 4.462 | 0.000 | 11.614 | 29.814 |
| City[T.Minneapolis, MN] | 0.6337 | 3.462 | 0.183 | 0.855 | -6.153 | 7.420 |
| City[T.New York, NY] | -9.1110 | 2.075 | -4.391 | 0.000 | -13.178 | -5.044 |
| City[T.Oakland, CA] | 7.0078 | 3.035 | 2.309 | 0.021 | 1.059 | 12.956 |
| City[T.Orlando, FL] | -12.5407 | 5.762 | -2.176 | 0.030 | -23.835 | -1.247 |
| City[T.Philadelphia, PA] | -1.0650 | 3.309 | -0.322 | 0.748 | -7.551 | 5.421 |
| City[T.Phoenix, AZ] | 8.0489 | 3.318 | 2.426 | 0.015 | 1.545 | 14.553 |
| City[T.Pittsburgh, PA] | 26.0356 | 3.640 | 7.154 | 0.000 | 18.902 | 33.169 |
| City[T.Portland, OR] | 3.8188 | 2.903 | 1.315 | 0.188 | -1.872 | 9.510 |
| City[T.Raleigh, NC] | 11.3658 | 3.492 | 3.255 | 0.001 | 4.522 | 18.209 |
| City[T.Richmond, VA] | 14.6142 | 4.859 | 3.008 | 0.003 | 5.090 | 24.138 |
| City[T.Sacramento, CA] | -4.5832 | 4.785 | -0.958 | 0.338 | -13.963 | 4.796 |
| City[T.Salt Lake City, UT] | 10.8487 | 3.887 | 2.791 | 0.005 | 3.230 | 18.468 |
| City[T.San Antonio, TX] | -3.2456 | 5.178 | -0.627 | 0.531 | -13.394 | 6.903 |
| City[T.San Diego, CA] | -2.0064 | 2.532 | -0.792 | 0.428 | -6.969 | 2.956 |
| City[T.San Francisco, CA] | 1.4795 | 2.048 | 0.722 | 0.470 | -2.535 | 5.494 |
| City[T.San Jose, CA] | 23.4459 | 2.082 | 11.262 | 0.000 | 19.365 | 27.526 |
| City[T.Seattle, WA] | 9.1111 | 2.094 | 4.352 | 0.000 | 5.007 | 13.215 |
| City[T.Washington, DC] | -11.8550 | 2.602 | -4.556 | 0.000 | -16.955 | -6.755 |
| yearsofexperience | 3.8427 | 0.034 | 113.500 | 0.000 | 3.776 | 3.909 |
| Q('Cost of Living Plus Rent Index') | 1.2979 | 0.018 | 70.575 | 0.000 | 1.262 | 1.334 |
| Omnibus: | 9385.385 | Durbin-Watson: | 1.947 |
|---|---|---|---|
| Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 92689.084 |
| Skew: | 1.414 | Prob(JB): | 0.00 |
| Kurtosis: | 11.700 | Cond. No. | 6.29e+16 |
Now that we have the parameters, we can predict the salary based on the above parameters!
# Algorithm to rank top cities for you based on your qualifications and desired job
# These are the cities with an adequate number of obersvations (>= 30)
cities = summary_df.City.tolist()
# we could also maybe do 10-fold here ?
def rank_cities(job_title, years_of_experience):
# let's return a pandas dataframe that includes: Job Title, Expected Salary, City, and Cost of Living
# Will be sorted on salary
# we can then use this same output to determine bang-for-your-buck if we have time
result = pd.DataFrame(columns = ["title", "expectedbasesalary", "city", "costofliving"])
numbeo_with_desired_cities = numbeo[numbeo["City"].isin(cities)]
# iterate the rows of our desired cities to extract cost of living
for index, row in numbeo_with_desired_cities.iterrows():
params = pd.DataFrame.from_dict({
"title": [job_title],
"yearsofexperience": [years_of_experience],
"Cost of Living Plus Rent Index": [row["Cost of Living Plus Rent Index"]],
"City": [row["City"]]
})
prediction = reg.predict(params)
result.loc[len(result)] = [job_title, prediction[0], row["City"], row["Cost of Living Plus Rent Index"]]
return result.sort_values(by = ["expectedbasesalary"], ascending = False)
test = rank_cities("Software Engineer", 0)
test.head()
| title | expectedbasesalary | city | costofliving | |
|---|---|---|---|---|
| 1 | Software Engineer | 153.492551 | San Francisco, CA | 100.72 |
| 16 | Software Engineer | 147.773951 | San Jose, CA | 79.39 |
| 0 | Software Engineer | 141.967533 | New York, NY | 100.00 |
| 2 | Software Engineer | 137.799550 | Oakland, CA | 84.37 |
| 11 | Software Engineer | 135.184136 | Los Angeles, CA | 75.05 |
Now we have a way to predict what your salary will be at a given level of experience for a given role in a specific location. We now want to figure out what city will give you the best bang for your buck - which one will pay the most while still being cheap to live in.
To do this, we will perform the following, simple calculation:
Predicted Base Salary ⁄ Cost of Living Index
# bang for your buck
# pass the df you received from rank_cities()
def best_value(df):
df["value"] = df["expectedbasesalary"]/df["costofliving"]
return df.sort_values(by = ["value"], ascending = False)
test_best_value = best_value(test)
test_best_value.head()
| title | expectedbasesalary | city | costofliving | value | |
|---|---|---|---|---|---|
| 32 | Software Engineer | 110.172981 | Little Rock, AR | 45.73 | 2.409206 |
| 13 | Software Engineer | 122.068544 | Pittsburgh, PA | 57.59 | 2.119614 |
| 29 | Software Engineer | 109.478872 | Madison, WI | 51.99 | 2.105768 |
| 25 | Software Engineer | 103.054193 | Richmond, VA | 51.74 | 1.991770 |
| 33 | Software Engineer | 110.497438 | Austin, TX | 56.81 | 1.945035 |
Now this is interesting! Only one of the top 10 highest paying cities seems to have a very good ratio of base salary to cost of living: Pittsburgh! But it's not even first, the number one best city to live in terms of salary and cost of lviing is Little Rock.
Based on our analysis and predictions, it seems Silicon Valley is not the best place to live in terms of salary and prices. Even though cities in the Bay Area do pay a lot of money, they are so expensive that it may not be worth moving there.
One thing to note is, because we narrowed down our data set to salaries from companies that appear many times, our data may be restricted to primarily large-scale private companies, as only they were able to provide sufficient information for analysis. It may be interesting to look deeper into other types of companies to see if the analysis changes.
Feel free to build on this tutorial! This is just one example of the data science pipeline, there's so much more to explore and analyze. We plan to use the regression analysis above as a backend for a salary and cost of living calculator, for which we will build a frontend UI for.